We will analyze different variables associated with money earned, average score, and wins on the PGA Tour during the 2018 season to uncover the relationships, effect sizes, and significance associated between them.
The distribution of money is skewed to the right. A log transform of that variable is needed here.
While as a whole there doesn’t seem to be much a relationship between rounds played and money earned, however, if we ignore the players who won at least one tournament and only look at the red points, we do see a positive relationship.
Players who are winning are simply making more money on average. The mean earnings of PGA Tour players who won at least once is 3,979,965 dollars, and those who didn’t win was 1,285,266 dollars.
The relationship between average score and money is clearly non-linear and also negative. Makes perfect sense that the higher your average score, the amount of money made should be lower, due to lower performences in tournaments.
As driving distance increases, there is in increase in money made. But the question is, are players making more money because they hit the ball further, or because they are shooting lower scores due to the fact they have an advantage off the tee?
Here we do see a negative relationship between average driving distance and average score, suggesting there may be an advantage to hitting the ball further on scores.
Average Scrambling is the percent of time a player misses the green, but still makes par or better. So essential this variable represents a players short game. Here we see, as average scrambling increases it doesn’t really have a big effect on money made.
However, average scrambling does have a big effect on average score, which in turn influences how much money a player makes. So what really is going on here, is that most of the variables have a strong releationship with average score, and thus is influencing how much money a player is making. So we decided to look into the variables that break down a golfer’s game, like driving, short game, putting, etc., and to use the player’s average score as a response.
The distribution of average score is approximately normal with a mean of 70.9 and standard deviation of 0.77.
Here are the remaining variables and their relationship with average scores.
| No | Yes | |
|---|---|---|
| short | 0.90 | 0.10 |
| average | 0.84 | 0.16 |
| long | 0.71 | 0.29 |
Chi-Squared Test \[ H_O: \text{Winning at least one tournament is independent of driving distance} \\ vs \\ H_A: \text{Winning at least one tournament depends on driving distance} \]
\[ \chi^2 = \sum\frac{(\mathrm{observed} - \mathrm{expected})^2}{\mathrm{expected}} \]
\[ \chi^2 \; | \; H_O \sim \chi^2_{df = 2} \]
\[ \chi^2 \; = 6.438 \]
\[ \text{Due to the small p value, we reject the null hypothesis and conclude winning at least one tournament depends on driving distance} \]
##
## Pearson's Chi-squared test
##
## data: golf_df$won_that_year and golf_df$driving_dist_cat
## X-squared = 6.438, df = 2, p-value = 0.03999
Model With All Variables:
\[ \mathrm{Average.Score} = \beta_0 + \beta_1 \mathrm{Rounds_i} + \beta_2 \mathrm{FairwayPerc_i} + \beta_3 \mathrm{GIR_i} + \beta_4 \mathrm{Avg.Putts_i} + \beta_5 \mathrm{Avg.Scrambling_i} + \beta_6 \mathrm{Avg.SG.Putts_i} \\ + \beta_7 \mathrm{SG.OTT_i} + \beta_8 \mathrm{SG.APR_i} + \beta_9 \mathrm{SG.ARG_i} + \beta_{10} \mathrm{Dist.Avg_i} + \beta_{11} \mathrm{Dist.Long_i} + \beta_{12} \mathrm{Won_i} + e_i \\ \hspace{1cm} {e}_i \sim i.i.d. \hspace{1mm} \mathcal{N} (0,\,\sigma^{2})\,. \]
Variance Inflation Factors
## GVIF Df GVIF^(1/(2*Df))
## Rounds 1.177485 1 1.085120
## Fairway.Percentage 2.550392 1 1.596995
## gir 5.513885 1 2.348166
## Average.Putts 5.473769 1 2.339609
## Average.Scrambling 2.948572 1 1.717141
## Average.SG.Putts 2.454406 1 1.566654
## SG.OTT 3.998934 1 1.999733
## SG.APR 2.122350 1 1.456829
## SG.ARG 1.977045 1 1.406074
## driving_dist_cat 3.482365 2 1.366056
## won_that_year 1.189684 1 1.090726
Greens in regulation and average putts both show a high variance inflation factor. So after removing gir, the VIF on average putts actual fell below the thereshold of 5. We then performed stepwise backwards selection and we ended up with the following model.
Model From Backwards Selection Via AIC Criterion:
\[ \mathrm{Average.Score} = \beta_0 + \beta_1 \mathrm{Rounds_i} + \beta_5 \mathrm{Avg.Scrambling_i} + \beta_6 \mathrm{Avg.SG.Putts_i} \\ + \beta_7 \mathrm{SG.OTT_i} + \beta_8 \mathrm{SG.APR_i} + \beta_9 \mathrm{SG.ARG_i} + \beta_{12} \mathrm{Won_i} + e_i \\ \hspace{1cm} {e}_i \sim i.i.d. \hspace{1mm} \mathcal{N} (0,\,\sigma^{2})\,. \]
##
## Call:
## lm(formula = Average.Score ~ Rounds + Average.Scrambling + Average.SG.Putts +
## SG.OTT + SG.APR + SG.ARG + won_that_year, data = golf_df_slim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59703 -0.12928 0.00706 0.12552 0.62939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.298783 0.385017 187.781 < 2e-16 ***
## Rounds -0.003058 0.001120 -2.730 0.006956 **
## Average.Scrambling -0.016681 0.006496 -2.568 0.011021 *
## Average.SG.Putts -0.944199 0.058233 -16.214 < 2e-16 ***
## SG.OTT -0.981893 0.044361 -22.134 < 2e-16 ***
## SG.APR -0.987245 0.048714 -20.266 < 2e-16 ***
## SG.ARG -0.824935 0.091928 -8.974 3.29e-16 ***
## won_that_yearYes -0.142803 0.041725 -3.423 0.000765 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2075 on 184 degrees of freedom
## Multiple R-squared: 0.9306, Adjusted R-squared: 0.9279
## F-statistic: 352.2 on 7 and 184 DF, p-value: < 2.2e-16
This model includes a variable that accounts for each part of a golfers game, including driving, approaching, chipping, and putting. While also controlling for factors such as rounds played and whether a player won at least one tournament or not.
\[
\textbf{Quality of fit: } \text{Our regression model, including 7 golf characteristics, accounts for 93% of the variability in average scores} \\
\text{and the associated residual standard error is 0.208, implying our model misses the true values by 0.208}
\]
Slope Interpretations:
\[
\textbf{
Average SG Putts: } \text {Per stroke increase in average strokes gained putting, holding rounds,} \\ \text{average scrambling, strokes gained approaching the green, strokes gained off the tee, and winning at} \\ \text{ least one tournament in 2018 constant, on average we expect a 0.944 decrease in average score.}
\]
\[ \textbf{ SG OTT: } \text {Per stroke increase in strokes gained off the tee, holding rounds,} \\ \text{average scrambling, strokes gained approaching the green, average strokes gained putting, and winning at } \\ \text{least one tournament in 2018 constant, on average we expect a 0.981 decrease in average score.} \]
\[
\textbf{
SG APR: } \text {Per stroke increase in strokes gained approaching the green, } \\ \text{holding rounds, average scrambling, average strokes gained putting, strokes gained off the tee, and winning at} \\ \text{ least one pga tournament in 2018 constant, on everage we expect a 0.987 decrease in average score. }
\]
Testing for Model Significance:
\[
\mathrm{H_O:} \beta_1 = \beta_2 = ... = \beta_7 = 0 \\ vs \\
H_A: \mathrm{at \; least \; \beta_j \neq 0} \\
\]
\[ \mathrm{FS} = \frac{\mathrm{RegSS / p}}{\mathrm{RSS}/(n-(p+1))} | \: \mathrm{Ho} \sim F_{p, \; n-(p+1))} \\ \]
\[ \mathrm{FS} = 352.2\\ \]
\[ \mathrm{p \;value = 2.2e-16} \\ \]
\[ \text{Based on the tiny p value, we reject the null hypothesis and conclude} \\ \text{the model is statistically significant} \]
Confidence Interval Interpretations:
\[
\textbf{
Average SG Putting: } \text{We are 95% confident that per stroke increase in average strokes gained putting,} \\ \text{holding rounds,} \text{ average scrambling, strokes gained approaching the green, strokes gained off the tee, and winning at} \\ \text{ least one tournament in 2018 constant, we expect average score to decrease between 1.059 and 0.83 strokes, on average.}
\]
\[ \textbf{ SG OTT: } \text{We are 95% confident that per stroke increase in strokes gained off the tee,} \\ \text{holding rounds,} \text{ average scrambling, strokes gained approaching the green, average strokes gained putting, and winning at } \\ \text{least one tournament in 2018 constant, we expect the average score to decrease between 1.069 and 0.89 strokes, on average.} \]
\[ \textbf{ SG APR: } \text{We are 95 percent confident that per stroke increase in strokes gained approaching the green,} \\ \text{holding rounds, average scrambling, average strokes gained putting, strokes gained off the tee, and winning at} \\ \text{ least one pga tournament in 2018 constant, we expect the average score to decrease between 1.01 and 0.64 strokes, on average.} \]
There doesn’t seem to be any discernable patters in the residuals and the constant variance of residuals is satisfied here.
Normality of error terms is also satisfied.
There does seem to be a possible interaction between rounds played and whether a player won at least one tournament that year. It makes sense that for players who are winning tournaments, playing more rounds doesn’t seem to have a big effect on average score becuase they are already playing good enough to win. However, players who aren’t winning, seem to be playing better the more that they play, suggesting a negative relationship between rounds played and average score.
Full Modeling Equation With Interaction Term:
\[ \mathrm{Average.Score} = \beta_0 + \beta_1 \mathrm{Rounds_i} + \beta_5 \mathrm{Avg.Scrambling_i} + \beta_6 \mathrm{Avg.SG.Putts_i} \\ + \beta_7 \mathrm{SG.OTT_i} + \beta_8 \mathrm{SG.APR_i} + \beta_9 \mathrm{SG.ARG_i} + \beta_{12} \mathrm{Won_i} + \beta_{13} (\mathrm{Rounds \times Won}) + e_i \\ \hspace{1cm} {e}_i \sim i.i.d. \hspace{1mm} \mathcal{N} (0,\,\sigma^{2})\,. \]
Incremental F-Test to test significance of interaction:
\[ \mathrm{H_O:} \; \beta_{13} = 0 \\ vs \\ \mathrm{H_A:} \beta_{13} \neq 0 \\ \]
\[ \mathrm{FS} = \frac{\mathrm{RegSS_F - RegSS_N / q}}{\mathrm{RSS_F}/(n-(p+1))} | \: \mathrm{Ho} \sim F_{q, \; n-(p+1))} \\ \]
\[ \mathrm{FS} = 17.715 \]
\[ \mathrm{p \;value = 4.017e-05} \\ \]
\[ \text{Based on the tiny p value, we reject the null hypothesis and conclude} \\ \text{the interaction term is statistically significant} \]
## Analysis of Variance Table
##
## Model 1: Average.Score ~ Rounds + Average.Scrambling + Average.SG.Putts +
## SG.OTT + SG.APR + SG.ARG + won_that_year
## Model 2: Average.Score ~ Rounds + Average.Scrambling + Average.SG.Putts +
## SG.OTT + SG.APR + SG.ARG + won_that_year + Rounds * won_that_year
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 184 7.9220
## 2 183 7.2228 1 0.69919 17.715 4.017e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression Outliers
Leverage of Outliers
We went ahead and removed points 28, 39 and 183 and fit the model again.
R-Squared with outliers:
- 0.9367
R-Squared without outliers:
- 0.936
RSE with outliers:
- 0.1987
RSE without outliers:
* 0.193
Due to the very small differences in the model, we decided not to remove the outliers.
After checking for collinearity and then performing variable selection via backwards AIC, we obtained the following model:
Full Modeling Equation
\[ p_i = P(\mathrm{Won} = 1 \; | \; \mathrm{Average.SG.Putts_i}, \; \mathrm{SG.OTT_i}, \; \mathrm{SG.APR_i}) \\ \]
\[ Y_i \sim \mathrm{ind} \; Bin(1, p_i) \]
\[ \mathrm{log(\frac{p_i}{1-p_i})} = \beta_0 + \beta_1 \; \mathrm{Average.SG.Putts_i} + \beta_2 \; \mathrm{SG.OTT_i} + \beta_3 \; \mathrm{SG.APR_i} \]
##
## Call:
## glm(formula = won_that_year ~ Average.SG.Putts + SG.OTT + SG.APR,
## family = "binomial", data = log_regres_golf_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2189 -0.6195 -0.4740 -0.2493 2.7917
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0857 0.2726 -7.650 2.01e-14 ***
## Average.SG.Putts 1.5962 0.7044 2.266 0.02344 *
## SG.OTT 1.8500 0.6861 2.696 0.00701 **
## SG.APR 1.4630 0.6473 2.260 0.02382 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 179.31 on 191 degrees of freedom
## Residual deviance: 156.70 on 188 degrees of freedom
## AIC: 164.7
##
## Number of Fisher Scoring iterations: 5
Due to the data being heaviliy skewed and there being a lot more players who didn’t win at least one tournament, the predictions are also skewed.
\[ H_O: \beta_1 = \beta_2 = \beta_3 = 0 \\ vs \\ H_A: \mathrm{at \; least \; \beta_j \neq 0} \\ \]
\[ \mathrm{p \; value: 4.89e-05} \\ \]
\[ \text{Based on the tiny p value when comparing the full model to the null model, we can reject } \\ \text{the null hypothesis and claim the model is statistically significant} \]
## Analysis of Deviance Table
##
## Model 1: won_that_year ~ 1
## Model 2: won_that_year ~ Average.SG.Putts + SG.OTT + SG.APR
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 191 179.31
## 2 188 156.71 3 22.601 4.89e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Slope Interpretations:
\[
\textbf{Probability: } \text{As average strokes gained putting increase, holding strokes gained off the tee} \\ \text{and strokes gained approaching the green constant, the probability of winning also increases.}
\]
\[ \textbf{Log-odds: }\text{Per stroke increase in average strokes gained putting, holding strokes gained off the tee } \\ \text{and strokes gained approaching the green constant, the log-odds of winning increase by 1.596} \]
\[ \textbf{Odds: } \text{Per stroke increase in average strokes gained putting, holding strokes gained off the tee} \\ \text{and strokes gained approaching the green constant, the odds of winning multiply by 4.93} \]
\[ \textbf{Odds: } \text{Per stroke increase in strokes gained off the tee, holding average strokes gained putting } \\ \text{and strokes gained approaching constant, the odds of winning multiply by 6.35.} \]
\[ \textbf{Odds: } \text{Per stroke increase in strokes gained approaching, holding average strokes gained putting } \\ \text{and strokes gained off the tee constant, the odds of winning multiply by 4.31.} \]
Confidence Interval Interpretations \[
\textbf{
Average SG Putting: } \text {With 95% confidence, we expect that a one stroke increase in average strokes gained putting, holding strokes gained}\\ \text{off the tee and strokes gained approaching the green constant, the odds of winning at least one pga tournament multiply between 1.28 and 20.5762. }
\]